Skip to content

fix(approx_fns): use exact percentile when no compression#21388

Open
aryan-212 wants to merge 3 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes
Open

fix(approx_fns): use exact percentile when no compression#21388
aryan-212 wants to merge 3 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes

Conversation

@aryan-212
Copy link
Copy Markdown
Contributor

@aryan-212 aryan-212 commented Apr 5, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

DataFusion's approx_percentile_cont / approx_median use a t-digest internally. The t-digest's interpolation step assumes centroids represent clusters of multiple points. But if the number of input rows is small (≤ the digest's max_size / compression threshold), no compression ever happens: every centroid has weight 1 and corresponds to exactly one input value.

In that regime, interpolation is not just unnecessary, it is actively wrong. The t-digest interpolates between adjacent centroids based on where the rank falls inside the centroid's weight, using half-deltas to neighbors. When every centroid has weight 1, this produces values that drift away from any actual data point.

This is particularly surprising for users running small queries or unit tests, they expect percentile functions on a handful of values to return one of those values.

Concrete Example

Let's take a small example from the TPCDS Schema

select cc_sq_ft from call_center;
none cc_sq_ft
1 6144
2 6144
3 19345
4 21156
5 21156
6 22743
7 34643
8 42935
9 52514
10 65772
11 76815
12 84336
13 105138
14 119886

Now if we take a small APPROX_PERCENTILE query like:

select approx_percentile(cc_sq_ft, 0.85) from call_center limit 50

From here, 0.85 * 14 yields 11.9 or 12 so the output for the above APPROX_PERCENTILE query should be 84336 and that is what we get when we run the same query in Databricks

Screenshot 2026-04-06 at 12 11 21 AM

But in DataFusion this comes up as:

Screenshot 2026-04-06 at 12 12 21 AM

This PR aims to fix this.

What was wrong before

Prior to this change, when no t-digest compression occurred, estimate_quantile still ran the t-digest interpolation path. This produced values that were:

  • Neither exact continuous percentiles (like percentile_cont)
  • Nor exact discrete percentiles (like percentile_approx / Databricks)
  • Just t-digest approximation artifacts on already-exact data

For example, approx_median on the 10-value window frame [-85, -72, -56, -48, -43, -25, -12, -5, 45, 83] returned -32 — not -34 (the true continuous median) and not -43 (the discrete nearest-rank median).

What changes are included in this PR?

  1. tdigest.rs: When no compression has occurred (self.count == self.centroids.len()), bypass the t-digest interpolation and use exact_quantile instead. This method uses the nearest-rank (ceiling) method: index = ceil(q * n) - 1, which returns an actual observed data value — matching Databricks' percentile_approx / approx_percentile semantics.

  2. Test expectation updates: Updated snapshot and SQL logic test expectations across:

    • datafusion/core/tests/dataframe/mod.rswindow_using_aggregates snapshot
    • datafusion/sqllogictest/test_files/aggregate.sltapprox_median, approx_percentile_cont, and approx_percentile_cont_with_weight test expectations
    • datafusion/sqllogictest/test_files/aggregate_skip_partial.sltapprox_median with grouping, nulls, and filters
    • datafusion/sqllogictest/test_files/metadata.sltapprox_median(distinct id) on small table

Are these changes tested?

Yes. All existing tests have been updated to reflect the new behavior. The key tests are:

  • window_using_aggregates — window function with approx_median over varying frame sizes
  • aggregate.sltapprox_percentile_cont at various percentiles (0.5, 0.95), including Float16/Float64/decimal types, with and without weights
  • aggregate_skip_partial.sltapprox_median with GROUP BY, nullable columns, and FILTER clauses
  • metadata.sltapprox_median(distinct id) regression test

Are there any user-facing changes?

Yes. approx_percentile_cont, approx_median, and approx_percentile_cont_with_weight will now return exact nearest-rank values (matching Databricks behavior) when the input dataset is small enough that no t-digest compression occurs (fewer than ~100 values per group by default). For larger datasets where compression happens, the existing t-digest approximation behavior is unchanged.

This means approx_median and percentile_cont(0.5) may now return different values for small datasets — this is expected and consistent with how Databricks distinguishes approximate vs exact percentile semantics.

@github-actions github-actions bot added the functions Changes to functions implementation label Apr 5, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 22718b8 to e997594 Compare April 6, 2026 06:02
@github-actions github-actions bot added the core Core DataFusion crate label Apr 6, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 40f862d to 95a4eff Compare April 6, 2026 06:26
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 6, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from d8339ff to 4f86249 Compare April 6, 2026 07:02
@aryan-212
Copy link
Copy Markdown
Contributor Author

How Databricks treats percentile vs approx_percentile

Databricks draws a clear semantic difference between its two percentile functions:

Function Semantics Behavior
percentile / percentile_cont Continuous — interpolates between adjacent values median([1, 2]) = 1.5
percentile_approx / approx_percentile Discrete — returns an actual observed value from the dataset approx_median([1, 2]) = 1

This was verified by running the equivalent window query on Databricks against the same 21-row dataset used in DataFusion's window_using_aggregates test. The Databricks output confirmed that percentile_approx picks the nearest-rank value (no interpolation), while percentile interpolates.

@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch from f0ef5fd to 48e384b Compare April 7, 2026 08:19
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch from 48e384b to 57fae2e Compare April 7, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants